34 



Meta-analyses: Heterogeneity can be a 
good thing 



Nan M. Laird 

Department of Biostatistics 

Harvard School of Public Health, Boston, MA 



Meta-analysis seeks to summarize the results of a number of different studies 
on a common topic. It is widely used to address important and dispirit prob- 
lems in public health and medicine. Heterogeneity in the results of different 
studies is common. Sometimes perceived heterogeneity is a motivation for the 
use of meta-analysis in order to understand and reconcile differences. In other 
cases the presence of heterogeneity is regarded as a reason not to summarize 
results. An important role for meta-analysis is the determination of design 
and analysis factors that influence the outcome of studies. Here I review some 
of the controversies surrounding the use of meta-analysis in public health and 
my own experience in the field. 



34.1 Introduction 

Meta-analysis has become a household word in many scientific disciplines. 
The uses of meta-analysis vary considerably. It can be used to increase power, 
especially for secondary endpoints or when dealing with small effects, to recon- 
cile differences in multiple studies, to make inferences about a very particular 
treatment or intervention, to address more general issues, such as what is 
the magnitude of the placebo effect or to ask what design factors influence 
the outcome of research? In some cases, a meta-analysis indicates substantial 
heterogeneity in the outcomes of different studies. 

With my colleague, Rebecca DerSimonian, I wrote several articles on meta- 
analysis in the early 1980s, presenting a method for dealing with heterogeneity. 
In this paper, 1 provide the motivation for this work, advantages and difficul- 
ties with the method, and discuss current trends in handling heterogeneity. 



381 



382 



Meta-analyses and heterogeneity 



34.2 Early years of random effects for meta-analysis 

I first learned about meta-analysis in the mid-1970s while I was still a gradu- 
ate student. Meta-analysis was being used then and even earlier in the social 
sciences to summarize the effectiveness of treatments in psychotherapy, the ef- 
fects of class size on educational achievement, experimenter expectancy effects 
in behavioral research and results of other compelling social science research 
questions. 

In the early 1980s, Fred Mostcller introduced me to Warner Slack and Doug 
Porter at Harvard Medical School who had done a meta-analysis on the effec- 
tiveness of coaching students for the Scholastic Aptitude Tests (SAT). Fred 
was very active in promoting the use of meta-analysis in the social sciences, 
and later the health and medical sciences. Using data that they collected on 
sixteen studies evaluating coaching for verbal aptitude, and thirteen on math 
aptitude. Slack and Porter concluded that coaching is effective on raising apti- 
tude scores, contradicting the principle that the SATs measure "innate" ability 
(Slack and Porter, 1980). 

What was interesting about Slack and Porter's data was a striking relation- 
ship between the magnitude of the coaching effect and the degree of control 
for the coached group. Many studies evaluated only coached students and 
compared their before and after coaching scores with national norms provided 
by the ETS on the average gains achieved by repeat test takers. These stud- 
ies tended to show large gains for coaching. Other studies used convenience 
samples as comparison groups, and some studies employed either matching or 
randomization. This last group of studies showed much smaller gains. Fred's 
private comment on their analysis was "Of course coaching is effective; oth- 
erwise, we would all be out of business. The issue is what kind of evidence is 
there about the effect of coaching?" 

The science (or art) of meta-analysis was then in its infancy. Eugene Glass 
coined the phrase meta-analysis in 1976 to mean the statistical analysis of 
the findings of a collection of individual studies (Glass, 1976). Early papers in 
the field stressed the need to systematically report relevant details on study 
characteristics, not only about design of the studies, but characteristics of 
subjects, investigators, interventions, measures, study follow-up, etc. However 
a formal statistical framework for creating summaries that incorporated het- 
erogeneity was lacking. I was working with Rebecca DerSimonian on random 
effects models at the time, and it seemed like a natural approach that could be 
used to examine heterogeneity in a meta-analysis. Wc published a follow-up 
article to Slack and Porter in the Harvard Education Review that introduced 
a random effects approach for meta-analysis (DerSimonian and Laird, 1983). 
The approach followed that of Cochran who wrote about combining the effects 
of different experiments with measured outcomes (Cochran, 1954). Cochran 
introduced the idea that the observed eflFect of each study could be partitioned 



N.M. Laird 



383 



into the sum of a "true" effect plus a sampling (or within-study) error. An esti- 
mate of the within-study error should be available from each individual study, 
and can be used to get at the distribution of "true" effects. Many of the social 
science meta-analyses neglected to identify within-study error, and in some 
cases, within-study error variances were not reported. Following Cochran, we 
proposed a method for estimating the mean of the "true" effects, as well as the 
variation in "true" effects across studies. We assumed that the observed study 
effect was the difference in two means on a measured scale and also assumed 
normality for the distribution of effects and error terms. However normality 
is not necessary for validity of the estimates. 

The random effects method for meta-analysis is now widely used, but in- 
troduces stumbling blocks for some researchers who find the concept of a 
distribution of effects for treatments or interventions unpalatable. A major 
conceptual problem is imagining the studies in the analysis as a sample from 
a recognizable population. As discussed in Laird and Mostellcr (1990), ab- 
sence of a sampling frame to draw a random sample is a ubiquitous problem 
in scientific research in most fields, and so should not be considered as a spe- 
cial problem unique to meta-analysis. For example, most investigators treat 
patients enrolled in a study as a random sample from some population of pa- 
tients, or clinics in a study as a random sample from a population of clinics 
and they want to make inferences about the population and not the particu- 
lar set of patients or clinics. This criticism does not detract from the utility 
of the random effects method. If the results of different research programs 
all yield similar results, there would not be great interest in a meta-analysis. 
The principle behind a random effects approach is that a major purpose of 
meta-analysis is to quantify the variation in the results, as well as provide an 
overall mean summary. 

Using our methods to re-analyze Slack and Porter's results, we concluded 
that any effects of coaching were too small to be of practical importance (Der- 
Simonian and Laird, 1983). Although the paper attracted considerable media 
attention (articles about the paper were published in hundreds of US news 
papers), the number of citations in the scientific literature is comparatively 
modest. 



34.3 Random effects and clinical trials 

In contrast to our paper on coaching, a later paper by DerSimonian and 
myself has been very highly cited, and led to the moniker "DerSimonian and 
Laird method" when referring to a random-effects meta-analysis (DerSimonian 
and Laird, 1986). This paper adapted the random effects model for meta- 
analysis of clinical trials; the basic idea of the approach is the same, but here 
the treatment effect was assumed to be the difference in Binomial cure rates 



384 



Meta-analyses and heterogeneity 



between a treated and control group. Taking the observed outcome to be a 
difference in Binomial cure rates raised various additional complexities. The 
difference in cure rates is more relevant and interpretable in clinical trials, 
but statistical methods for combining a series of 2 x 2 tables usually focus on 
the odds ratio. A second issue is that the Binomial mean and the variance 
are functionally related. As a result, the estimate of the within-study variance 
(which was used to determine the weight assigned to each study) is correlated 
with the estimated study effect size. We ignored this problem, with the result 
that the method can be biased, especially with smaller samples, and better 
approaches are available (Emerson et al., 1996; Wang et al., 2010). The final 
issue is estimating and testing for heterogeneity among the results, and how 
choice of effect measure (rate difference, risk ratio, or odds ratio) can affect 
the results. Studies have shown that choice of effect measure has relatively 
little effect on assessment of heterogeneity (Berlin et al., 1989). 

The clinical trials setting can be very different from data synthesis in the 
social sciences. The cndpoint of interest may not correspond to the primary 
endpoint of the trial and the number of studies can be much smaller. For 
example, many clinical trials may be designed to detect short term surrogate 
endpoints, but are under-powered to detect long term benefits or important 
side effects (Hine et al., 1989). In this setting a meta-analysis can be the best 
solution for inferences about long term or untoward side effects. Thus the 
primary purpose of the meta-analysis may be to look at secondary endpoints 
when the individual studies do not have sufficient power to detect the effects of 
secondary endpoints. Secondly, it is common to restrict meta-analyses to ran- 
domized controlled trials (RTCs) and possibly also to trials using the same 
treatment. This is in direct contrast to meta-analyses that seek to answer 
very broad questions; such analyses can include many types of primary stud- 
ies, studies with different outcome measures, different treatments, etc. In one 
of the earliest meta-analyses, Beecher (1955) sought to measure the "placebo" 
effect by combining data from 15 clinical studies of pain for a variety of dif- 
ferent indications treated by different placebo techniques. 

Yusuf et al. (1985) introduced a "fixed effects" method for combining the 
results of a series of controlled clinical trials. They proposed a fixed-effect 
approach to the analysis of all trials ever done, published or not, that pre- 
sumes we are only interested in the particular set of studies we have found 
in our search (which is in principle all studies ever done). In practice, statis- 
ticians are rarely interested in only the particular participants in our data 
collection efforts, but want findings that can be generalized to similar partic- 
ipants, whether they be clinics, hospitals, patients, investigators, etc. In fact, 
the term "fixed" effects is sometimes confused with the equal effects setting, 
where the statistical methods used implicitly assume that the "true" effects 
for each study are the same. Yusuf et al. (1985) may have partly contributed 
to this confusion by stating that their proposed estimate and standard error 
of the overall effect do not require equality of the effects, but then cautioning 
that the interpretation of the results is restricted to the case where the effects 



N.M. Laird 



385 



are "approximately similar." As noted by Cochran, ignoring variation in the 
effects of different studies generally gives smaller standard errors that do not 
account for this variance. 



34.4 Meta-analysis in genetic epidemiology 

In recent years, meta-analysis is increasingly used in a very different type of 
study: large scale genetic epidemiology studies. Literally thousands of reports 
are published each year on associations between genetic variants and diseases. 
These reports may look at only a few genetic variants in specified genes, 
or they may test hundreds of thousands of variants in all the chromosomes 
in Genome Wide Association Scans (GWAS). These GWAS reports are less 
than ten years old. but now are the standard approach for gene "discovery" 
in complex diseases. Because there are so many variants being tested, and 
because the effect sizes tend to be quite small, replication of positive findings in 
independent samples has been considered a requirement for publication right 
from the beginning. But gradually there has been a shift from reporting the 
results of the primary study and a small but reasonable number of replications, 
to pooling the new study and the replication studies, and some or all available 
GWAS studies using the same endpoint. 

In contrast to how the term meta-analysis is used elsewhere, the term 
meta-analysis is often used in the genetic epidemiology literature to describe 
what is typically called "pooling" in the statistical literature, that is, analyz- 
ing individual level data together as a single study, stratifying or adjusting 
by source. The pooling approach is popular and usually viewed as more de- 
sirable, despite the fact that studies (Lin and Zeng, 2010) have shown that 
combining the summary statistics in a meta-analysis is basically equivalent 
to pooling under standard assumptions. I personally prefer meta-analysis be- 
cause it enables us to better account for heterogeneity and inflated variance 
estimates. 

Like all epidemiological studies, the GWAS is influenced by many design 
factors which will affect the results, and they do have a few special features 
which impact on the use of pooling or meta-analysis, especially in the context 
of heterogeneity. First, the cost of geno typing is great, and to make these 
studies affordable, samples have been largely opportunistic. Especially early 
on, most GWAS used pre-existing samples where sufficient biological material 
was available for genotyping, and the traits of interest were already available. 
This can cause considerable heterogeneity. For example, I was involved in 
the first GWAS to find association between a genetic variant and obesity as 
measured by body mass index (BMI); it illustrates the importance of study 
design. 



386 



Meta-analyses and heterogeneity 



The original GWAS (with only 100,000 genetic markers) was carried out 
in the Framingham Heart Study (Herbert et al.. 2006). We used a novel ap- 
proach to the analysis and had data from five other cohorts for replication. 
All but one of the five cohorts reproduced the result. This was an easy study 
to replicate, because virtually every epidemiological study of disease has also 
measured height and weight, so that BMI is available, even though the study 
was not necessarily designed to look at factors influencing BMI. In addition, 
replication required genotyping only one genetic marker. In short order, hun- 
dreds of reports appeared in the literature, many of them non-replications. 
A few of us undertook a meta-analysis of these reported replications or non- 
replications with the explicit hypothesis that the study population influences 
the results; over 76,000 subjects were included in this analysis. We considered 
three broad types of study populations. The first is general population cohort 
where subjects were drawn from general population samples without restrict- 
ing participants on the basis of health characteristics. Examples of this are the 
Framingham Heart Study and a German population based sample (KORA). 
The second is healthy population samples where subjects are drawn from pop- 
ulations known to be healthier than those in the general population, typically 
from some specific work force. The third category of studies included those 
specifically designed to study obesity; those studies used subjects chosen on 
the basis of obesity, including case-control samples where obese and non-obese 
subjects were selected to participate and family-controlled studies that used 
only obese subjects and their relatives. 

In agreement with our hypothesis, the strongest result we found was that 
the effect varied by study population. The general population samples and 
the selected samples replicated the original study in finding a significant as- 
sociation, but the healthy population studies showed no evidence of an effect. 
This is a critically important finding. Many fields have shown that random- 
ized versus non- randomized, blinded versus unblinded, etc., can have major 
effects, but this finding is a bit different. Using healthy subjects is not in- 
trinsically a poor design choice, but may be so for many common, complex 
disorders. One obvious reason is that genetic studies of healthy subjects may 
lack sufficient variation in outcome to have much power. More subtle factors 
might include environmental or other genetic characteristics which interact 
to modify the gene effect being investigated. In any event, it underscores the 
desirability of assessing and accounting for heterogeneity in meta-analyses of 
genetic associations. 

A second issue is related to the fact that there are still relatively few 
GWAS of specific diseases, so many meta-analyses involve only a handful of 
studies. The random effects approach of DerSimonian and Laird does not work 
well with only a handful of studies because it estimates a variance, where 
the sample size for the variance estimate is the number of studies. Finally 
for a GWAS, where hundreds of thousands of genetic markers are tested, 
often on thousands of subjects, any meta-analysis method needs to be easily 
implemented in order to be practically useful. Software for meta-analysis is 



N.M. Laird 



387 



included in the major genetic analysis statistical packages (Evangelou and 
loannidis, 2013), but most software packages only implement a fixed effects 
approach. As a result, the standard meta-analyses of GWAS use the fixed 
effects approach and potentially overstate the precision in the presence of 
heterogeneity. 



34.5 Conclusions 

The DcrSimonian and Laird method has weathered a great deal of criticism, 
and undoubtedly we need better methods for random effects analyses, espe- 
cially when the endpoints of interest arc proportions and when the number 
of studies being combined is small and or the sample sizes within each study 
are small. Most meta-analyses involving clinical trials acknowledge the impor- 
tance of assessing variation in study effects, and new methods for quantifying 
this variation are widely used (Higgins and Thompson, 2002). In addition, 
meta-regression methods for identifying factors influencing heterogeneity are 
available (Berkey et al., 1995; Thompson and Higgins, 2002); these can be used 
to form subsets of studies which are more homogeneous. There is an extensive 
literature emphasizing the necessity and desirability of assessing heterogene- 
ity, and many of these reinforce the role of study design in connection with 
heterogeneity. The use of meta-analysis in genetic epidemiology to find disease 
genes is still relatively new, but the benefits are widely recognized (loannidis 
et al., 2007). Better methods for implementing random effects methods with 
a small number of studies will be especially useful here. 



References 

Beecher, H.K. (1955). The powerful placebo. Journal of the American Med- 
ical Association, 159:1602-1606. 

Berkey, C.S., Hoaglin, D.C., Mosteller, F., and Colditz, G.A. (1995). A 
random-effects regression model for meta-analysis. Statistics in Medicine, 
14:395-411. 

Berlin, J. A., Laird, N.M., Sacks, H.S., and Chalmers, T.C. (1989). A compar- 
ison of statistical methods for combining event rates from clinical trials. 

Statistics in Medicine, 8:141-151. 

Cochran, W.G. (1954). The combination of estimates from different experi- 
ments. Biometrics, 10:101-129. 



388 



Meta-analyses and heterogeneity 



DerSimonian, R. and Laird, N.M. (1983). Evaluating the effect of coaching 
on SAT scores: A meta-analysis. Harvard Educational Review, 53:1-15. 

DerSimonian, R. and Laird, N.M. (1986). Meta-analysis in clinical trials. 
Controlled Clinical Trials, 7:177-188. 

Emerson, J.D., Hoaglin, D.C., and Mosteller, F. (1996). Simple robust pro- 
cedures for combining risk differences in sets of 2 x 2 tables. Statistics in 
Medicine, 15:1465-1488. 

Evangelou, E. and loannidis, J. P. A. (2013). Meta-analysis methods for 
genome-wide association studies and beyond. Nature Reviews Genetics, 
14:379-389. 

Glass, G.V. (1976). Primary, secondary, and meta-analysis of research. Edu- 
cational Researcher, 5:3-8. 

Herbert, A., Gerry, N.P., McQueen, M.B., Held, I.M., Pfeufer, A., Illig, T., 
Wichmann, H.-E., Meitinger, T., Hunter, D., Hu, F.B. et al. (2006). A 
common genetic variant is associated with adult and childhood obesity. 
Science, 312:279-283. 

Higgins, J. and Thompson, S.G. (2002). Quantifying heterogeneity in a meta- 
analysis. Statistics in Medicine, 21:1539-1558. 

Hine, L., Laird, N.M., Hewitt, P., and Chalmers, T. (1989). Meta-analytic 
evidence against prophylactic use of lidocaine in acute myocardial infarc- 
tion. Archives of Internal Medicine, 149:2694-2698. 

loannidis, J.P.A., Patsopoulos, N.A., and Evangelou, E. (2007). Heterogene- 
ity in meta-analyses of genome-wide association investigations. PloS one, 
2:e841. 

Laird, N.M. and Mosteller, F. (1990). Some statistical methods for combining 
experimental results. International Journal of Technological Assessment 
of Health Care, 6:5-30. 

Lin, D. and Zeng, D. (2010). Meta-analysis of genome-wide association stud- 
ies: No efficiency gain in using individual participant data. Genetic Epi- 
demiology, 34:60-66. 

Slack, W. and Porter, D. (1980). The scholastic aptitude test: A critical 
appraisal. Harvard Educational Review, 66:1-27. 

Thompson, S.G. and Higgins. J. (2002). How should meta-regression analyses 
be undertaken and interpreted? Statistics in Medicine, 21:1559-1573. 

Wang, R., Tian, L., Cai, T., and Wei, L. (2010). Nonparametric inference pro- 
cedure for percentiles of the random effects distribution in meta-analysis. 
The Annals of Applied Statistics, 4:520-532. 



N.M. Laird 



389 



Yusuf, S., Peto, R., Lewis, J., Collins, R., and Sleight, P. (1985). Beta block- 
ade during and after myocardial infarction: an overview of the randomized 
trials. Progress in Cardiovascular Diseases, 27:335-371. 



